In [1]:
import musicntd.scripts.hide_code as hide
C:\Users\amarmore\AppData\Local\Continuum\anaconda3\envs\NTD_segmentation\lib\site-packages\librosa\util\decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location.
Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit
C:\Users\amarmore\AppData\Local\Continuum\anaconda3\envs\NTD_segmentation\lib\site-packages\librosa\util\decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location.
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit

Autosimilarity or barwise normalized autosimilarity?

In the 1st notebook, we presented the method to segment autosimilarity matrices.

Should autosimilarities be normalized barwise ?

In our opinion, yes. Indeed, when normalizing the autosimilairty matrices, the diagonal (representing self-similarities of bars) is entirely composed of elements equal to one, which seems convenient for self-similarities. In the same time, normalizing can help reduce disperancies in intensity between bars, which seems desirable to compare them.

In that sense, future autosimilarities are considered normalized barwise, even if not specified.

How to handle the chromas information?

Classically, $W$ factor matrix should be treated as the other matrices and be optimized in the algorithm.

But, in this context, it represents chromas information, and each chroma represent a pitch-class in the western tonal scale. We could assume that, in general, 2 tones are never totally correlated on a song, even it may be false on particular examples. Under this assumption, the perfect decomposition of this mode could be to have 12 columns perfectly decorrelated, and so, a matrix where each column represent one and unique chroma, which is the identity matrix, with potentially a column permutation.

Even if tones appear to be correlated, fixing $W$ to the identity matrix wouldn't be a problem as they could be mixed in musical patterns in the core tensor.

In that sense, fixing $W$ to the identity matrix (denoted Id12 in figures) could reduce complexity of the NTD algortihm, without losing compressibility capacity.

Though, this assumption need to be validated, and this is the goal of experiments below.

Comparison on an example: "1.wav" from RWC Pop

Below are represented the factors matrices resulting of NTD when $W$ is fixed at the identity (left) and when $W$ is optimized as every mode (right).

In [2]:
ranks = [12,32,32]
hide.compare_chromas_tucker_decomp("1", ranks)
c:\users\amarmore\desktop\projects\phd main projects\on git\code\tensor factorization\musicntd\autosimilarity_segmentation.py:43: RuntimeWarning: invalid value encountered in true_divide
  this_array = np.array([list(i/np.linalg.norm(i)) for i in this_array.T]).T

Computation on the entire RWC with several ranks

To decide between fixing $W$ or optimizing it as the other modes, we decided to compare segmentation results on the entire RWC Pop dataset, with different ranks for $H$ and $Q$ (16 and 32).

In [3]:
ranks_rhythm = [16,32]
ranks_pattern = [16,32]
annotations_type = "MIREX10"

These results are presented below, respectively with 0.5 seconds and 3 seconds tolerance windows.

Fixing $W$ to Id12

In [4]:
zero_five_chr, three_chr = hide.compute_ranks_RWC(ranks_rhythm,ranks_pattern, W = "chromas", annotations_type = annotations_type, penalty_weight = 0)
Résultats à 0.5 secondes Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
Rang Q:16 Rang H:16 9.2500 9.3600 9.5600 0.5122 0.4981 0.4941
Rang H:32 9.4000 9.3800 9.4100 0.5132 0.5064 0.4999
Rang Q:32 Rang H:16 10.9000 11.5400 7.9100 0.4937 0.5858 0.5264
Rang H:32 10.9400 11.7800 7.8700 0.4876 0.5922 0.5259
Résultats à 3 secondes Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
Rang Q:16 Rang H:16 11.9000 6.7100 6.9100 0.6633 0.6402 0.6374
Rang H:32 11.9900 6.7900 6.8200 0.6575 0.6445 0.6381
Rang Q:32 Rang H:16 13.3400 9.1000 5.4700 0.6094 0.7165 0.6464
Rang H:32 13.3700 9.3500 5.4400 0.6010 0.7197 0.6433

Optimizing $W$ with the other modes

In [5]:
zero_five_tk, three_tk = hide.compute_ranks_RWC(ranks_rhythm,ranks_pattern, W = "tucker", annotations_type = annotations_type, penalty_weight = 0)
Résultats à 0.5 secondes Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
Rang Q:16 Rang H:16 9.0700 9.9600 9.7400 0.4949 0.4878 0.4782
Rang H:32 9.1300 10.5800 9.6800 0.4803 0.4951 0.4758
Rang Q:32 Rang H:16 10.5100 11.8900 8.3000 0.4845 0.5647 0.5105
Rang H:32 10.7000 12.4800 8.1100 0.4752 0.5732 0.5097
Résultats à 3 secondes Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
Rang Q:16 Rang H:16 11.6100 7.4200 7.2000 0.6369 0.6251 0.6144
Rang H:32 11.6600 8.0500 7.1500 0.6179 0.6299 0.6089
Rang Q:32 Rang H:16 12.8400 9.5600 5.9700 0.6000 0.6904 0.6282
Rang H:32 13.2100 9.9700 5.6000 0.5896 0.7073 0.6311

Note: these results don't come from the original computation (as it was lost in the meantime), and so come from the last version of the algorithm (at this time), which explains why they may differ from the ones obtained in the 1st notebook.

Conclusion

  • As previously stated, we decided to normalize barwise our autosimilarities.
  • Regarding $W$, we decided to fix $W$ to the 12-size identity matrix. Indeed, in addition of a little gain in segmentation scores (as shown above in tables), fixing $W$ to the identity result in a gain in complexity and computation time, as well as more interpretable results (as we ensure that each column of $W$ represents one and only one semi-tone).